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Abstract. In this work, we have explored the advantages and drawbacks of using 
GPUs instead of CPUs in the calculation of a standard 2-point correlation function 
algorithm, which is useful for the analysis of Large Scale Structure of galaxies. Taking 
into account the huge volume of data foreseen in upcoming surveys, our main goal has 
been to accelerate significantly the analysis codes. We find that GPUs offer a 100-fold 
increase in speed with respect to a single CPU without a significant deviation in the 
results. For comparison's sake, an MPI version was developed as well. Some issues, 
like code implementation, which arise from using this option are discussed. 



1. Introduction 

The two-point correlation function (2pcf) is a simple statistic that quantifies the clus- 
tering of a given distribution of objects. In studies of the Large Scale Structure (LSS) 
of the Universe, this is an important tool containing information ab out the matter c lus- 
tering and the Universe evolution at different cosmological epochs, Peebles! (1 1980h . A 
classical application of this statistic is the gala xy-galaxy correlation function to find 
constraints on the matter density paramete r O m . lHawkins et ail d2003h . or the location 



of the baryonic acoustic oscillation peak, ISanchez et all ( 201 lh . Other examples in- 
clude cross-correlation of background gala xies with the shear of o bjects caused by the 
gravitational effect on light (weak lensing), iDodelson et all (120081) . 

The 2pcf measures the excess probability of finding a couple of galaxies separated 
by spatial distance r or angular distance 9 with respect to the probability of finding a 
couple of galaxies separated by the same distance or angle in a random and uniform 
distribution. In this work we have used the angular version of the correlation function 
w(9) though results ar e extendible to the 3 -dim ensional variant as well. 

Landy & Szalay, lLandv&SzaIavl(ll993h . found an estimator with minimum vari- 
ance which is the standard one used in cosmological analyses: 

, 1 . / NrarKlom \Q DD(0) r, ( N ran( jom \ DR(0) / 1 \ 

wW " 1 + { ^ZT )Z ' Rim ~ 1 ' ( -~auT- ) ' RW) W 

where N ga i is the number of galaxies in a real catalog, N r d is the number of galaxies 
in a random catalog, DD{9) is the number of pairs separated by an angular distance 9 in 
the real catalog, RR{9) is the number of pairs separated by an angular distance 9 in the 
random catalog and DR{9) is the number of pairs separated by an angular distance 9 in 
the real catalog with respect to the random catalog. 
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2. Computational problem and previous work 



The calculation of 2pcf, Eq{TJ is very costly computationally so alterna tive strategies 
have b een desi gned to approach th e problem (pixelization of the map, Eriksen et al. 
(I2004h . t-trees. lMoore et all (l2Q00h), us ually at the cost of some loss of information. 

Alternatively, in iRoeh et al.l (120091) . this problem has been treated with GPUs us- 
in g a differen t stra tegy in terms of shared memory usage. In particular, the authors 
of Roeh et al. (2009) have used a 'chessboard' strategy where arrays are passed to the 
global memory. This has the disadva ntage of haying restrictions in the input sample. 
Also, the particular implementation in lRoeh et al.l ( 2009h obtained results in cosO space, 
thus complicating the cosmological interpretation of the result. 



3. Implementation and hardware 



We have implemented in CUDA the Landy-Szalay estimator with the following key 
features: 

• Usage of shared memory (instead of global memory) for the dot product and 
arc-cosine operations necessary to extract the angle between two objects. 

• Application of atomic operations in shared memory to make use efficiently of 
multi-threading when filling up the histograms (DD, DR and RR in Eq.Q}. Partial 
histograms are generated in parallel in shared memory and later combined in a 
single histogram, in global memory. 

• In one of the architectures we had available, we applied a multi-GPU solution 
using 3 GPUs, one for each of the histograms, in which DD and RR where used in 
one of the boards containing 2 GPUs and DR in the other for maximum efficiency. 



A full description of the algorithm and its implementation can be found in lCardenas-Montes et al 



(201 1). The hardware we have used to test our codes is in Tabled] 



CPU 


GPU 


MPI 


CPU with two Intel 
Xeon E5520 processors 
at 2.27 GHz 


GTX295 
CI 060 (Tesla) 
C2050 (Tesla) 


1920 cores (two 
Intel Xeon E5570 
at 2.93 GHz, per node) 



Table 1 . Hardware specifications that we have used. 



4. Results and analysis 



The g al axy catalogs used a re publicly available from the MICE project. iFosalba et al. 
(120081) : ICrocce et all (1201 Oh . 

In Table [2] we present a comparison between the execution time of CPU imple- 
mentation and the execution time of GPU implementation. 

In Fig. |l(a)| we show, for MICE catalog, one of the correlation functions calculated 
using this code, versus the same calculation using a standard implementation in C for 
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Input file lines 


CPU (s) 


GTX295 (s) 


CI 060 (s) 


C2050 (s) 


0.43 • 10 b 


3.60 • 10 4 


3.01 • 10 2 


2.91 • 10 2 


2.19- 10 2 


0.86 • 10 b 


1.44 • 10 5 


1.20- 10 3 


1.16- 10 3 


8.76 • 10 2 


1.00- 10 b 


1.98 ■ 10 5 


1.61 • 10 3 


1.56- 10 3 


1.17 • 10 3 


1.29- 10 b 


3.24 ■ 10 5 


2.68 • 10 3 


2.59 • 10 3 


1.97 • 10 3 


1.72- 10 b 


5.76 • 10 5 




4.64 • 10 3 


3.51 • 10 3 


3.45 • 10 b 


2.32 • 10 b 




1.88 • 10 4 


1.41 • 10 4 


6.89 • 10 b 


9.22 ■ 10 b 




7.45 • 10 4 


5.61 • 10 4 



Table 2. Comparison between CPU execution time and diverse GPUs execution time. 



CPUs, for reference. The residuals at each point are plotted in Fig. |l(b)| and are far 
below the expected errors due to cosmic variance, i.e., the statistical errors due to the 
small number of 'fields' available in the sky. 



Comparison between CPU and GPU Correlation Function 



WGPU-WCPU Histogram 



CPU results 
GPU results 




Figure 1 . Panel (a, left) shows a comparison between correlation functions, the 
red one was calculated with the CPU code, the green one with the GPU code, while 
panel (b, right) shows the residuals between GPU and CPU codes. These residuals 
are really small and fall into the statistical errors. 



We have also done a comparison between GPUs and MPI. In Fig. [2] we have our 
MPI time with GPUs time like a boxplot graphic. 



5. Conclusions 

We have developed an implementation of the Landy-Szalay two-point correlation func- 
tion in CUDA to make use of the power GPUs have to offer in terms of parallelization. 
The speed-up with respect to a CPU is 164-fold (C2050) using the same algorithm. 
With respect to an implementation of k-trees in CPUs we obtain an increase of 6.4-fold 
(for 0.43 • 10 b objects). Several MPI configurations have been explored being the GPU 
implementation surpassed by the usage of more than 64 nodes, see Fig. [2] 

Some options to be explored remain, such as full-blown multi-GPU implementa- 
tion, coding the k-trees or extending the work to higher order correlation functions, for 
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Execution Time Results 
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Figure 2. GPU and MPI execution time results. 

other types of cosmological analyses such as understanding non-Gaussianities in the 
primordial perturbations. 
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